value <- "Hello World"
valueLS4003 R tutorial 2
Correlations in R
In this tutorial we’re going to use R to plot correlations, including a line of best fit, R value and P value.
See example below for the end point:
Make sure you’ve completed the tutorial 1 section on using R from excel before starting here.
Install and Set-up
A refresher for how to install and set up R and RStudio.
To get set up, follow the below steps. Click each step to see the instruction and the screenrecording.
This is an online, cloud-based option. It’s a bit more limited than running on a university computer or your own computer, but the free option should be enough for this module.
Go to Posit Cloud and create a free account
Log in, then go to New Project -> New RStudio Project.
Make a new folder in the bottom right panel (by clicking the New Folder button) called “LS4003_Statistics”.
Click on this folder to enter it, and then click the More cog (bottom right panel) and select “Set as Working Directory”.
To run R on your own machine, you have to install R (the programming language) and RStudio (the development environment).
When installing, click the most appropriate option for your machine (Windows/Mac/Linux)
Once you have installed both, open RStudio.
Navigate to your Documents folder in bottom right panel. (If you can’t find it, type in setwd("~/Documents") to the console on the bottom left, then click the More cog on the bottom right and select “Go to Working Directory”)
Create a new folder called LS4003_Statistics by clicking the New Folder button on the right hand side.
Click on your folder (LS4003_Statistics) to enter it.
Set that as your final working directory by clicking on the ‘More’ cog icon again and select “Set as Working Directory”.
Import data from CSV file
The dataset we are going to use is heart.csv which you can find on the Canvas page.
This dataset uses various metrics relating to heart health.
| column | data |
|---|---|
| age | age of patient in years |
| sex | Gender of the patient (F/M) |
| restbp | Resting blood pressure (mmHg) |
| chol | Serum cholesterol (mg/dl) |
| maxheartrate | Maximum heart rate achieved |
Which of these columns contain categorical data and which contain continuous data?
Plotting age verses resting blood pressure
We could speculate that resting blood pressure may increase with age. To plot this, we can use the ggplot function geom_point.
Calculating Pearson’s coefficient
We can use two R functions to calculate our R and p values using a Pearson’s coefficient.
Using cor()
Our first example uses cor(). This gives us the R value - how strong is the correlation?
Let’s break that down:
cor(heart_df$age, heart_df$restbp, method = 'pearson')cor(): This is our function name. This is a built in function.heart_df$age: This is going to our dataframe called heart_df and extracting the column of values under the column name “age”heart_df$restbp: This is going to our dataframe called heart_df and extracting the column of values under the column name “restbp”
method = 'pearson'- This is a parameter, which is an option to choose how we want the function to work.
- We can set this to either
'pearson'or'spearman'depending on which test we want to use.
Using cor.test()
Our second example uses cor.test(). This gives us the R value - how strong is the correlation?
As you can see that works very similarly to our first example, except for the last two lines.
cor.test() gives us a list of 9 values, there are only two we are interested in:
p.valueis our probability p valueestimateis our correlation coefficient (R value, same as above)
Annotate the R and p values onto the scatter graph
To annotate our R and p values onto a scatter graph, we can use the stat_cor function from the ggpubr package.
If you’ve not already installed it, make sure that first you run:
install.packages('ggpubr')
The only parameter we used here for stat_cor was method = 'pearson' so that it would plot a pearson’s correlation.
Do the R and p values annotated on this graph match the results from cor.test()?
Moving the R and P values
We can use parameters label.x and label.y to add co-ordinates for where we want the R and p value annotations to go.
Try it below:
If you’d like the R and p value to each be on their own line, we can add label.sep = '\n' which adds a \n newline character
What happens if you add label.sep = 'HELLO'?
Add a regression line
We can also use the funtion geom_smooth to fit and plot a linear regression model (lm) to our graph.
We use formula= y~x to define y as the outcome variable and x as the predictor: older age (x axis, predictor) leads to higher resting blood pressure (y axis, predicted outcome.)
Separate by categorical group (sex)
If we add a color based on a categorical variable, this will then calculate the regression separately for each group.
See below, where we have assigned each point a colour based on sex:
Correlelograms
Our dataset contained more than just age and blood pressure - it would be useful to see if there are any correlations between e.g. blood pressure and maximum heart rate, or cholesterol and age.
We can also use R to do an all-against-all correlation analysis, so without doing all of the above we can get an idea of if there are any correlations.
This uses the corrplot library. You might need to run install.packages('corrplot') before you can run this code.
Let’s break that down:
library(corrplot)is loading the corrplot package, which has the functions we needheart_df_only_numerical <- heart_df[,-2]Our sex in this dataframe is categorical (M or F) so we need to remove it
heart_df[,-2]is selecting the whole of theheart_dfdataframe except column 2
heart_corrplot_matrix <- cor(heart_df_only_numerical)- This uses the
cor()function to calculate all-against-all correlations
- This uses the
corrplot(heart_corrplot_matrix- This takes our all-against-all correlations and plots a correlelogram
Corrplot also has different settings you can use.
To plot the strength of the correlation (R value) numerically, try: corrplot(heart_corrplot_matrix, method = "number")
And if you only want to show the correlations that are statistically significant with a 5% chance of error:
corrplot(heart_corrplot_matrix, sig.level = 0.05)
That’s the end of the tutorial - now move on to Worksheet 2.